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Abstract 

Slow Feature Analysis (SFA) extracts features representing the underlying causes of changes within a 
temporally coherent high-dimensional raw sensory input signal. Our novel incremental version of SFA 
(IncSFA) combines incremental Principal Components Analysis and Minor Components Analysis. Unlike 
standard batch-based SFA, IncSFA adapts along with non-stationary environments, is amenable to episodic 
training, is not corrupted by outliers, and is covariance-free. These properties make IncSFA a generally 
useful unsupervised preprocessor for autonomous learning agents and robots. In IncSFA, the CCIPCA and 
MCA updates take the form of Hebbian and anti-Hebbian updating, extending the biological plausibility 
of SFA. In both single node and deep network versions, IncSFA learns to encode its input streams (such 
as high-dimensional video) by informative slow features representing meaningful abstract environmental 
properties. It can handle cases where batch SFA fails. 



1 Introduction 



Slow feature analysis (Wiskott and Sejnowski , 2002 ; Wiskott et al. 201 l| l(SFA) is an unsupervised learning 
technique that extracts features from an input stream with the objective of maintaining an informative 
but slowly-changing feature response over time. The idea of using temporal stability as an objective in 



learning systems has motivated some other unsupervised learning techniques (Hinton 1989 Foldiak 1991 



Mitchison 1991 Schmidhuber 1992a Bergstra and Bengio 2009 1. SFA is distinguished by its formulation 
of the feature extraction problem as an eigensystem problem, which guarantees that its solution methods 
reliably converge to the best solution, given its constraints (no local minima problem). SFA has shown 
success in problems such as extraction of driving forces of a dynamical system (Wiskott 2003) >, nonlinear 
blind source separation (Sprekeler et al. 2010[ ), as a preprocessor for reinforcement learning (Legenstein 



et al. 2010 Kompella et al. 201 lb I, and learning of place-cells, head-direction cells, grid-cells, and spatial 
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view cells from high-dimensional visual input ( Franzius et al. 2007 1 — such representations also exist in 



biological agents ( jO'Keefe and Dostrovsky|[T97T]|Taube eTal^[l^^[RoiIsl[T999||Hafting et al.||2005} . 

There are limitations to existing SFA implementations due to their batch processing nature, which 
becomes especially apparent when attempting to apply it in somewhat uncontrolled environments. To 



overcome these issues, we introduce the new Incremental Slow Feature Analysis (IncSFA) ( Kompella et al 



201 la|b| l. A few earlier techniques with temporal continuity objective were incremental as well (Hinton 



1989, Bergstra and Bengio 2009), but IncSFA follows the SFA formulation and can track solutions of 



batch SFA (BSFA), over which it has the following advantages: 

• Adaptation to changing input statistics. BSFA requires all data to be collected in advance. New 
data cannot be used to modify already learned slow features. Once the input statistics change, IncSFA 
can automatically adapt its features without outside intervention, while BSFA has to discard previous 
features to process the new data. 

In open-ended learning settings, an autonomous agent's lifelong input stream follows such a nonsta- 
tionary distribution. The agent's behavior will typically change over time, thus generating new input 
sequences. Features useful for early behaviors may not be useful for later behavior. 

• Learn features across episodes. Episodic learning is impossible for BSFA, since it cannot handle 
temporal discontinuities at episode boundaries. IncSFA, however, may use the final slow features 
from the previous episode to initialize its features of the next episode. 

• Reduced sensitivity to outliers. Real-world environments typically exhibit infrequent, uncontrolled, 
insignificant external events that should be ignored. BSFA is very sensitive to such events, encoding 
everything that changes slowly within the current batch. IncSFA's plasticity, however, makes it lose 
sensitivity to such events over time. 



• Covariance-free. BSFA techniques rely upon batch Principal Component Analysis ( Jolliffe 1986 ) 
(PCA), which requires the data's covariance matrix. Estimating, storing and/or updating covari- 
ance matrices can be expensive for high-dimensional data and impractical for open-ended learning. 
IncSFA uses covariance-free techniques. For high-dimensional images, the number of parameters 
to estimate in the covariance matrix is huge: n(n + l)/2 for dimension n, while a covariance-free 
technique only requires nxm where m is the desired number of principal components (PCs). For ex- 
ample, 100 x 100 dimensional images lead to 50, 005, 000 free parameters in the covariance matrix, 
which is reduced to only 100 x 100 x m eigenvector parameters with covariance-free updating. 

Furthermore, since often only a relatively small number of principal components are needed to ex- 
plain most of the variance in the data, the other components do not even have to be estimated. With 
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IncSFA, dimensionality reduction can be done during PC estimation; no time needs to be wasted on 
computing the many insignificant lower-order PCs. 

Another technical problem with the covariance matrix: in sequences where only a small part of the 
input changes, computing the principal components of the difference signal's covariance matrix will 
result in singularity errors, since the matrix won't have full rank. 

• Biological Plausibility. IncSFA adds further biological plausibility to SFA. SFA itself is linkable to 
biological systems due to the results in deriving place cell, grid cells, etc., but it is difficult to see how 
BSFA could be realized in the brain. IncSFA's updates, however, can be described in incremental 
Hebbian and anti-Hebbian terms. 

The remainder of this paper is organized as follows. Section [2] reviews SFA and its batch solution. 
Section[3]describes the new incremental SFA. Section [4] details the algorithm and discusses deeper related 
issues, including convergence conditions and parameter setting. Section [5] contains experiments and re- 
sults, and shows how to utilize IncSFA as part of a hierarchical image processing architecture. Section [6] 
concludes the paper. 



2 Background 



2.1 SFA: Intuition 



We first review SFA briefly in an intuitive sense. SFA is a form of unsupervised learning (UL). It searches 
for a set of mappings gifiom data x 6 1Z 1 to output components j/j = <?i(x) that are separate from 
each other in some sense and express information that is in some sense relevant. In SFA separateness is 
realized as decorrelation (like in PCA), while relevance is defined in terms of slowness of change over 
time. Ordering our functions g\, g2, gi by slowness, we can discard all but the J < I slowest, to enable 
dimensionality reduction, getting rid of irrelevant information such as quickly changing noise assumed to 
be useless. The compact relevant data encodings reduce the search space for downstream goal-directed 



learning procedures (Schmidhuber 1999 Barlow 2001). As an example, consider a high-dimensional 
dynamical system: a mobile robot sensing with an onboard camera, where each pixel is considered a 
separate observation component. SFA will use the video sequences to guide its search over functions that 
encode each image into a small set of state variables, and the robot can use these new state variables to 
quickly develop useful controllers. Fig.[TJprovides a visual example of how SFA operates. 

SFA-based UL learns instantaneous features from sequential data ( |Hinton 1989 Wiskott and Se- 
jnowski 2002| Doersch et al. I. Relevance cannot be uncovered without taking time into account, but 



once it is known, each input frame can be processed on its own. SFA differs from both 1 . many well- 
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Figure 1: Intuition of SFA. (A): Consider a zero-mean input signal that spatially resembles white noise. 
Assume the input distributions are Gaussian, shown by the gray area, while the black dots show individual 
data points. Spatial feature extractors such as PCA will not prefer any direction over any other. (B): 
Eschewing unhelpful spatial processing, we examine this input as a time-series; to illustrate here we show 
a short sequence of input. Each difference vector becomes a spatial component in the space shown in (C). 
In this space, the first principal component gives the (linear) direction of quickest change. The second — 
the minor component — gives the direction of slowest change. We see that recoding the data in terms of 
subsequent differences and performing an eigendecomposition provides an ordered set of separate features, 
which are applied to the original input signal. 



known unsupervised feature extractors (Abut 1990 Jolliffe 1986 Comon 1994 Lee and Seung 1999 



|Kohonen] |200 1 1 [Hinton| |2002| l, which ignore dynamics, and 2. Other UL systems that both leam and apply 
features to sequences (Schmidhuber 1992a c b; Lindstadt, 1993 Klapper-Rybicka et al. 2001 Jenkins and 
Mataric, 2004; Lee et al. 2010| Gisslen et al. 201 1 1, thus assuming that the state of the system itself can 



depend on past information. 



2.2 SFA: Formulation 



SFA's optimization problem (Wis kott and Sejnowski| |2002| |Franzius et al.|[2007| ) is formally written as 
follows: 

Given an I-dimensional sequential input signal x(<) = [xi(t), x \ (t)] T , find a set of J instantaneous 
real-valued functions g(x) = [g\ (x), <?,/(x)] T , which together generate a J -dimensional output signal 
y{t) = [yi(t),...,yj(t)] T withyj(t) := gj(x(t)), such that for each j g {!,..., J} 



Aj := A(j/j) := is minimal 



(1) 
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under the constraints 

(Vj) = (zero mean), (2) 

(y|) = 1 (unit variance), (3) 

Vi < j : {UiVj) — (decorrelation and order), (4) 

with (•) and y indicating temporal averaging and the derivative of y, respectively. 

The problem is to find instantaneous functions gj that generate different output signals varying as slowly 
as possible. The constraints (|2| and ([3]l together avoid a trivial constant output solution. The decorrelation 
constraint Q ensures that different functions gj do not code for the same features. 

2.3 Batch SFA 

Solving this learning problem involves non-trivial variational calculus optimization. But it is simplified 
through an eigenvector approach. If the gj are linear combinations of a finite set of nonlinear functions h, 
then 

Vj (t) = ffj -(x(t)) - wj h(x(t)) - wj z(t), (5) 

and the SFA problem now becomes to find weight vectors Wj to minimize the rate of change of the output 
variables, 



A( % ) = <y|> = wj (zz J )w„ (6) 

subject to the constraints (2-4). The slow feature learning problem has become linear on the derivative 
signal z. 

If the functions of h are chosen such that z has unit covariance matrix and zero mean, the three con- 
straints will be fulfilled if and only if the weight vectors Wj are orthonormal. Eq. [6]will be minimized, and 
the orthonormal constraint satisfied, with the set of J normed eigenvectors of (zz T ) with the J smallest 
eigenvalues (for any J < /). 

The BSFA technique practically implements this solution by using batch principal component analysis 
(PCA) (Jolliffe 1986) twice. Referring back to Eq. [6] to select h appropriately, a well-known process 
called whitening (or sphering), is used to map x to a z with zero mean and identity covariance matrix, thus 
decorrelating signal components and scaling them so that there is unit variance along each PC direction. 
Whitening serves as a bandwidth normalization, so that slowness can truly be measured (slower change 
will not simply be due to a low variance direction). Whitening requires the PCs of the input signal (PCA 
#1). The orthonormal basis that minimizes the rate of output change are the minor components - principal 
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components with smallest eigenvalues - in the derivative space. So another PCA (#2) on z yields the slow 
features (eigenvectors) and their order (via eigenvalues). 

3 Incremental SFA 

IncSFA also employs the eigenvector tactic, but may update an existing estimate on any amount of new 
data, even a single data point x(i). A high-level formulation is 



(W(i + 1), 0(t + 1)) = IncSFA(W(t),x(t),9(t)), 



(7) 



where W = (wi, Wj) is the matrix of existing slow feature vector estimates, and 9 contains algorithm 
memory and parameters, which we will discuss later. 

To replace PCA #1, IncSFA needs to do online whitening of input x. We use Candid Covariance-Free 



Incremental (CCI) PCA (Weng et al. 2003 i. CCIPCA incrementally updates both the eigenvectors and 
eigenvalues necessary for whitening, and does not keep an estimate of the covariance matrix. CCIPCA is 
also used to reduce dimensionality. 

Except for low-dimensional derivative signals z, CCIPCA cannot replace PCA #2. It will be unsta- 
ble, since the slow features correspond to the least significant components. Minor Components Analysis 
(MCA) (Oja 1992) incrementally extracts the principal components with the smallest eigenvalues. We use 
Peng's low complexity updating rule (Pen g et aL]|2007| l. Peng proved its convergence even for constant 



learning rates — good for open-ended learning. MCA with sequential addition (Chen et al. 2001 Peng and 
|YT}|2006| ) will extract multiple slow features in parallel. 



3.1 Neural Updating for PC and MC Extraction 

CCIPCA and Peng's MCA are the most appropriate incremental PCA and MCA algorithms for IncSFA. 
To justify these choices, we briefly review the literature on neural networks that perform incremental PCA 
and MCA. 



Well-known incremental PCA algorithms are Oja and Karhunen's Stochastic Gradient Ascent (SGA) (Oja 
[1985]), Sanger's Generalized Hebbian Algorithm (GHA) ( |Sanger| [T989] l, and CCIPCA. They all build on 
the work of Amari (1977) and Oja (1982), who showed that a linear neural unit using Hebbian updating 

However, SGA 



could compute the first principal component of a data set (Amari 



1977 



Oja 



1982 



(1985) builds upon Oja's earlier work, GHA (1989) builds upon SGA, and CCIPCA (2003) builds upon 
GHA. 



Much earlier work of a non-neural network flavor had shown how the first PC, including the eigenvalue could be learned incre- 
tally jKrasuli na|[T970) 
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SGA use Gram-Schmidt Orthonormalization (GSO) to incrementally find the subspace of all princi- 
pal components, but there is no guarantee of finding the components themselves. Sanger used Rreyszig's 



(Kreyszig 1988 ) (1988) (faster/more effective) residual vector method for computing multiple components. 
His provably converging GHA used the residual method for simultaneous computation of all components. 
CCIPCA (Weng et al. 2003) modified GHA to be "candid", meaning it maintained an implicit learning 
rate dependant on the data, greatly increasing the algorithm's efficiency so that it became useful for high- 
dimensional inputs, such as in appearance-based computer vision. This incremental PCA updating method 
is the best of the above for IncSFA. It converges (Zhang and Weng 2001| > to both eigenvectors and eigen- 
values, necessary since whitening requires both. Due to its candidness, potentially difficult learning rate 
"hand-tuning" is minimized. 



As for MCA: Xu et al. (Xu et al. 19921 were the first to show that a linear neural unit equipped 
with anti-Hebbian learning could extract minor components. Oja modified SGA's updating method to an 
anti-Hebbian variant ( |Oja| [1992), and showed how it could converge to the MC subspace. Studying the 
nature of the duality between PC and MC subspaces (Wang and Karhunen, 1996 ; Chen et al. 1998| l, Chen, 
Amari and Lin ( |Chen et"aL| |200l] ) (2001) introduced the sequential addition technique, enabling linear 
networks to efficiently extract multiple MCs simultaneously. Building upon previous MCA algorithms, 
Peng (2007) (Pen g et aL] |2007| l derived the conditions and a learning rule for extracting MCs without 
changing the learning rate. Sequential addition was added to this rule so that multiple MCs could be 
extracted ( |Peng and Yi| |2006| l. We use this MCA updating method since it gives us the actual minor 
components, not just the subspace they span, and it allows for a constant learning rate, which can be quite 
high, leading to a quick reasonable estimate of the true components. 



3.2 CCIPCA Updating 

Given zero-mean data u = x — E [x] , a PC is a normed eigenvector v* of the data covariance matrix E [uu T ] . 
Eigenvalue A* is the variance of the samples along v* . By definition, an eigenvector and eigenvalue satisfy 

E[uu T ]v* = A*v*, (8) 

The set of eigenvectors are orthonormal, and ordered such that > > ••■ > . 
The whitening matrix is generated by multiplying the matrix of principal components V = [vj, .-v^] 
by the diagonal matrix D, where component di : i = After whitening via z(t) = VDu(i), then 

E[zz T ] = /. In IncSFA, we use online estimates of V and D. Both eigenvectors and eigenvalues need to 
be estimated. 

CCIPCA updates V and D from each sample. For inputs u^, the first PC is the expectation of the 
normalized response-weighted inputs. Eq|8]can be rewritten as 



Technical Report No. IDSIA-07-1 1 



8 



Ajv?=E[(u i .vJ)u i ], (9) 
The corresponding incremental updating equation, where A* v* is estimated by Vj(i), is 

~Ui(t)-Vi(t-l) 



v i (t) = (i-7 ? ) Vi (t-i) + n 



u t (t) 



(10) 



|Vi(t-l) 

where 77 is the learning rate. In other words, both the eigenvector and eigenvalue of the first PC of 
can be found through the sample mean-type updating in Eq. [9] The estimate of the eigenvalue is given by 
\ = ||vi(i)||. Using both a learning rate 77 and retention rate (1 — 77) automatically controls the adaptation 
of the vector with respect to the magnitude of the data vectors, leading to efficiency and stability. 

3.3 Lower-Order Principal Components 

Any component i > 1 not only must satisfy Eq. [8] but must also be constrained to be orthogonal to the 
higher-order components. The residual method generates observations in a complementary space so that 
lower-order eigenvectors can be found by the same update rule Eq. [TO] 

Denote u%(t) as the observation for component i. When i = 1, Ui(i) = u(t). When i > 1, u, is a 
residual vector, which has the "energy" of u(t) from the higher-order components removed. Solving for 
the first PC in this residual space solves for the i-th component overall. To create a residual vector, 
is projected onto Vj to get the energy of that v, is responsible for. Then, the energy-weighted is 
subtracted from to obtain u,+i: 



Together, Eq. 10 and Eq. 11 constitute the CCIPCA technique, which was proven to converge to the 
true components (Zhan gand Weng||2001| l. Yet, due to the residual method, the speed of learning is in line 
with the order: the first PC must be "sufficiently correct" before the second PC can start to learn, and so 
on. 

3.4 MCA Updating 

After using CCIPCA components to generate an approximately whitened signal z, the derivative is ap- 
proximated by z(i) = z(t) — z(t — 1). In this derivative space, the minor components on z are the slow 
features. 

To find the minor component, Peng's MCA update rule (Peng et al. 2007| l is used, 
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w;(t) = 1.5wi(t-l)-riCiWi(t-l) (12) 
- r,[wf{t-l)wi(t-l)]wi{t-l), 



where, for the first minor component, Ci = z(t)z T (t). 

For stability and convergence, the following constraints must be satisfied, 



77A1 < 0.5, (13) 
l|w(0)|| 2 <^, (14) 
w T (0)w* ^ (15) 

where w(0) is the initial feature estimate and w* the true eigenvector associated with the smallest eigen- 
value. Basically, the learning rate must not be too large, and the initial estimate must not be orthogonal to 
the true component. 



3.5 Lower-Order Slow Features 

Sequential addition shifts each observation into a space where the minor component of the current space 
will be the first PC, and all other PCs are reduced in order by one. It does this by adding the scale of the 
first PC to the already estimated slow feature directions. This allows IncSFA to extract more than one slow 
feature in parallel. Sequential addition updates the matrix Cj, Vi > 1 as follows: 



C,(t) = C 4 _!(t) +7(*) (wi-xWw^t)) / (w^wi-xft)) 



(16) 



Note Eq. 16 introduces parameter 7, which must be larger than the largest eigenvalue of E[z(t)z T (t)]. 
To automatically set 7, we compute the greatest eigenvalue of the derivative signal through another CCIPCA 
rule to update only the first PC. Then, let 7 = Ai (t) + e for small e. 



3.6 Link to Hebbian and Anti-Hebbian Updating 

BSFA has been shown to derive slow features that operate like biological grid cell^j from quasi-natural 
image streams, which are recorded from the camera of a moving agent exploring an enclosure ( Fra nzius] 



et al. 20071. In rats, grid cells are found in entorhinal cortex (EC) (Hafting et al. 2005 1, which feeds 



into the hippocampus. Augmenting the BSFA network with an additional competitive learning (CL) layer 



9 

A grid cell has high firing rate when the animal is in certain positions in its closed environment — viewed from above, the pattern 
resembles a grid. 
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derives units similar to place, head-direction, and spatial view cells. Place cells and head-direction cells 
are found in rat hippocampus (O'Keef e and Dostrovsky||1971[|Taube et al.| [T990), while spatial view cells 



are found in primate hippocampus ( Rolls 1999 ). 



Although BSFA results exhibit the above biological link, it is not clear how this technique might be 
realized in the brain. In particular, the space required for a covariance matrix of high-dimensional input 
is too large. IncSFA does not require covariance maatrices, and takes the form of biologically plausible 
Hebbian and anti-Hebbian updating. 

3.1 Hebbian Updating in CCIPCA 

Hebbian updates of synaptic strengths of some neuron make it more sensitive to expected input activa- 
tions payan and Abbott||20()T] >: 

v i- v + t] g(v, u) u, (17) 

where u represents pre-synaptic (input) activity, and g post-synaptic activity (a function of similarity be- 
tween synaptic weights v and input potentials u). The basic Eq. [TTJrequires additional care (e.g., normal- 
ization of v) to ensure stability during updating. To handle this in one step, learning rate 77 and retention 
rate 1 — 77 can be used, 

v <— (1 — r])v + 77 g{y, u) u. (18) 

where < 77 < 1. With this formulation, Eq. [TUJis Hebbian, where the post-synaptic activity is the 
normalized response g(y, u) = '|j ^ — - and the presynaptic activity is the input u^. 

3.2 Anti-Hebbian Updating in Peng's MCA 

The general form of anti-Hebbian updating simply results from flipping the sign in Eq. [T7] In IncSFA 
notation: 

w 4— w — 77 g(w, z) z. (19) 
To see the link between Peng's MCA updating and the anti-Hebbian form, in the case of the first MC, 



we note Eq. 12 can be rewritten as 
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wi <— 1.5wi — 7] [Ci wi + [w[wj Wi] , (20) 

<— 1.5wi — t] |(z ■ wi) z + (wi • Wi) wi] , (21) 

1.5wi — rj ||wi|| 2 wi — rj ((z • Wi) z) , (22) 

<- (1.5 - J] || wi || 2 ) wi - <q (z ■ wi) z, (23) 

where (z • wi) indicates post-synaptic strength, and z pre-synaptic strength. 

When dealing with nonstationary input, as we do in IncSFA due to the simultaneously learning CCIPCA 
components, it is acceptabla^l to normalize the magnitude of the slow feature vectors: w$ <— Wj/||wj||. 



Normalization ensures non-divergence (see Section 4. 1 1. If we normalize, Eq. 20 can be rewritten in the 
even simpler form 

wi <- (1 - r?)wi - r)(b ■ wi) z, (24) 
wi wi/||wi|| (25) 

an even more basic anti-Hebbian updating with retention rate and learning rate. Now, for all other slow 
features i > 1, the update can be written so sequential addition shows itself to be a lateral competition 
term: 

/ \ 
w, <- (1-J7)wj-J7 (z-Wj)z + 7 ^(wj ■ w,)wj . (26) 

4 IncSFA Algorithm 

Now we can present the algorithm for a single IncSFA unit. For each time step t = 0, 1, . . .: 

1. Sense: Grab the current raw input as vector x(t). 

2. Non-Linear Expansion: (optionally) Generate an expanded signal x(t) with / components, e.g. for 
a quadratic expansion: 

x(i) = ...^i^.i^tJ.iiftJiaW.-.^W] ( 27 ) 



J Peng: personal communication. 
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3. Mean Estimation and Subtraction: The signal must be centered (zero mean). This can be done 
incrementally if needed. If t = 0, set x(t) = x(0). Otherwise, update mean vector estimate x(t): 

x(t) = (l-»7)x(t-l) + »7x(t). (28) 

4. Variance Estimation and Normalization: (optionally) The variance of the signal can be normal- 
ized. To do so incrementally, the variance estimates a = (<7i, ...,<tj) are updated: 

(Ji{t) - (1 - 77) <Ti(t - 1) + 77 - Si(t)) 2 , Vz (29) 

and normalize each component's variance by dividing by the estimate. 

For the following steps, u(i) is the processed signal, which has zero mean and unit variance. 

5. CCIPCA: Update estimates of the most significant K principal components of u, where K < I: 

(a) If t < K, initialize v t (t) = u(i). 



(b) Otherwise do for j = 1, 2, K: Let Ui(t) = u(t); execute CCIPCA equations 10 and 11 



6. Whitening and Dimensionality Reduction: Let V(t) contain the normed estimates of the K 
principal components, ordered by estimated eigenvalue, and create diagonal matrix D(t), where 
A,» = l/V^) 5 Vi < ^- Then ' z (0 = V(f)D(f)u(/j). 

7. Derivative Signal: As a forward difference approximation of the derivative, let z(t) = z(i) — z(t — 
I)- 

8. Extract First Principal Component: Use CCIPCA to update the first PC of z (to set sequential 
addition parameter j(t)). 

9. Update Slowness Measure: The slowness measure of the signal i is computed and updated incre- 
mentally to automatically set the learning rate for MCA. 

10. Slow Features: Update estimates of the least significant J PCs of z, where J < K: 

(a) If t < J, initialize w f = i(t). 

(b) Otherwise, let Ci(t) = z(t)z T (t), and for each i = 1. J, execute incremental MCA updates 



in equation 



26 



11. Normalize Slow Feature Estimates: (optionally, for stability) Each Wj Wj/||wj 

12. Output: y(t) = z T (t)W(t) is the SFA output. 
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4.1 Convergence of IncSFA 

It is clear that if whitened signal z is drawn from a stationary distribution, the MCA convergence proof ( jPengj 



et al. 2007 1 applies. But typically the whitening matrix is being learned simultaneously. In this early stage, 
while the CCIPCA vectors are learning, care must be taken to ensure that the slow feature estimates will 
not diverge. 

It was shown that for any initial vector w(0) within the set S, 

S = jw(t)|w(f) G TZ K and ||w(t)|| 2 < i-j , (30) 

will remain in S throughout the dynamics of the MCA updating. ||w|| must be prevented from getting 
too large until the whitening matrix is close to accurate. With respect to lower-order slow features, there 
is additional dependence on the sequential addition technique, parameterized by j(t) = Ai(t) + e. This 
7(i) also needs time to estimate a close value to the first eigenvalue Ai. Before these estimates become 
reasonably accurate, the input can knock the vector out of S. 

In practice, normalization of w after each update was found to be the most useful. If ||w(0)|| = 1 
then any learning rate 77 < 0.5 ensures non-divergence. Another applicable tactic is clipping. If the signal 
z is thresholded, e.g., from -5 to 5, the potential effect of outliers is controlled. A third tactic is to use a 
gradually increasing MCA learning rate. 

Even if w remains in S, the additional constraint w T (0)w* 7^ is needed for the convergence proof. 
But this is an easy condition to meet, as it is unlikely that any w(i) will be exactly orthogonal to the true 
feature. In practice, it may be advisable to add a small amount of noise to the MCA update. But we did not 
find this to be necessary. 



As for CCIPCA: If the standard conditions on learning rate (Papoulis et al. 19651 (including con- 
vergence at zero), the first stage components will converge to the true PCs, leading to a "nearly-correct" 
whitening matrix in reasonable time. So, if x is stationary, the slow feature estimates are likely to become 
quite close to the true slow features in a reasonable amount of updates. 

In open-ended learning, convergence is not desired. Yet by using a learning rate that is always nonzero, 
the stability of the algorithm is reduced. This corresponds to the well-known stability-plasticity dilemma (Gross- 
bergl[T980) . 
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4.2 Setting Learning Rates 

,4 



In CCIPCA, if 77 = j, Eq. 10 will be the most efficient estimatoQof the principal component. But a 
learning rate of l/t is spatiotemporally optimal if every sample from t = 1, 2, 00 is drawn from the same 
distribution, which will not be the case for the lower-order components, and in general for autonomous 
agents. We use an amnesic averaging technique, where the weights of old samples diminuish over time. 
Amnesic averages remain unbiased estimators of the true PCs. For Eq. 10 E[v(n)] — > E[u], as n — > 00. 



To set the CCIPCA learning rate, (and other learning rates, e.g., for the input average x), we used the 
following three-sectioned amnesic averaging function [i\ 

if t < *i, 

M(*) = ^ c(t-*i)/(*a-*i) if *i < t < ta, ( 31 ) 
c+(t-t 2 )/r if t 2 < t- 
Eq.[3TJcombines optimal updating and plasticity for each feature. It uses three stages, defined by points 
t\ and In the first stage, the learning rate is \. In the second, the learning rate is scaled by c to speed up 
learning of lower-order components. In the third, it changes with t, eventually converging to 1/r. 

Unlike with Peng's MCA, there is no convergence proof for CCIPCA and this type of learning. Instead, 
plasticity introduces an expected error that will not vanish ( |Weng and Zh ang 2006 ). To see this, note that 
any component estimate is a weighted sum of all the inputs: 

t 

v(f) = J>(*)u(t), (32) 



T = l 



where J2t=i = 1- Then, 

E\w(t) - vi 2 = ]r P 2 mH\ 2 = J2 p 2 v Manual) 03) 



gives expected estimation error as a function of number of samples T. Eq. 33 can be used to estimate 
the number of samples needed to get below an acceptable expected error bound, if the signal is stationary. 
Otherwise the process retains the ability to adapt at any future t. This introduces some expected error that is 
linked to the learning rate into the IncSFA whitening process. Our results show that this is not problematic 
for many applications, but merely leads to a slight oscillatory behavior around the true features. 

To prevent divergence while CCIPCA is still learning, we used a slowly rising learning rate for MCA, 
starting from low rji at t — and rising to high ijh at t = T, 



4"The most efficient estimator on average requires the least samples for learning among all unbiased estimators. The sample mean 
is the maximum likelihood estimator (i.e., most efficient unbiased estimator) of the population mean for several distribution types, 
e.g., Gaussian. 
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m + (vh - m) * (tY 

Vh 

Ideally, T is a point in time when whitening has stabilized. 



if t < T, 
if T < t. 



(34) 



The upper bound i]b of permissible rjh is defined by the first condition in Eq. 13 



Vh < Vb 



1 



2A X ' 



(35) 



where Ai is the greatest eigenvalue of the signal. Constant values close to but below the bound can be used 
to achieve faster convergence. 

The algorithm maintains an incremental estimate of intermediate output slowness. This can be used to 
automatically adapt the MCA learning rate to changing statistics of the input stream. Since MCA receives 
a derivative of the whitened input signal, the greatest eigenvalue Ai corresponds to the component that 
changes most rapidly. As a fast approximation, we set 



Ai w maxA(i 1 ) = A(i m ), 



(36) 



where z m is the m th dimension of z, which has maximal temporal variation. The A-value ( 37 1 measures 
temporal variation of the signal x(t). It is given by the mean square of that signal's temporal derivative. 
The smaller the A-value, the slower the variation of the corresponding signal component. 



A(x) = (x(t) 2 ) (37) 

The A-value is related to Wiskott & Sejnowski's (Wiskott and Sejnowski 2002 1 slowness measure of 
the input signal given by 



5(x) 



P 

2^ 



(38) 



The value S for some signal of length P indicates how often a pure sine wave of the same A value 
would oscillate. 



Now, from Eq. 36 and Eq. 38 we have 



A(i m ) cx S(z m ) 2 (39) 

Ai cx S{z m ) 2 (40) 

Since r/t = 5x7, we get 

Vb cx S(z m )- 2 (41) 
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Selecting rjh close to r\b (see 35 1, we can write 

Vh = Vb~ip (42) 

for some arbitrarily small constant ip. 



From Eq. 41 and Eq. 42 we get 



r) h cx S(z m )- 2 (43) 

With a working learning rate rjh and the slowness measure estimate for some input, we can automati- 
cally adapt rjh for a new signal by tracking how its slowness measure changes. 

4.3 Dimensionality Reduction Parameter 

The eigenvectors of x associated with the smallest eigenvalues might represent noise dimensions. Instead 
of passing this typically useless information to our second phase, the small eigenvalue directions can be 
discarded. 

While whitening the /-dimensional input signal, the dimension can be reduced to K < I. K can be 
automatically tuned. A method we found to be successful is to set K such that no more than a certain 
percentage of the previously estimated total data variance (the denominator below) is lost. Let j3 be the 
ratio of total variance to keep (e.g., 0.95), and compute the smallest K such that 

^ Afc(t) > p. (44) 
EjA k (t-l) 

5 Experiments and Results 

Some of our experiments are designed to show that IncSFA derives the same features as batch SFA. Others 
show how IncSFA can work in scenarios where batch SFA is not applicable, and how IncSFA can be 
utilized in high-dimensional video processing applications. Experiments were done either using Python 



(using the MDP toolbox ( T. Zito and Berkes 2008 1) or Matlab. 



5.1 Proof of Concept 

As a basic proof of concept, IncSFA is applied to problem introduced in the original SFA paper ( Wiskott 



and Sejnowski 2002). The input signal is 



xx(t) = sin(i) +cos(ll t) 2 , (45) 
x 2 (t) = cos(ll t), t G [0,2tt], (46) 
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2D Input Signal 




Epoch 



(b) 

Figure 2: Experiment with a simple non-linear input signal. A learning rate of rj = 0.08 is used, (a) Input 
Signal (b) Output RMSE plot (c) Batch SFA output of the first slow feature (d)-(f) IncSFA output at t = 2, 
5, 10 epochs, (g) Batch SFA output of the second slow feature (h)-(j) IncSFA output at t = 2, 5, 10 epochs. 

Both vary quickly over time (see Figure|2|a)). A total of 2, 000 discrete datapoints are used for learning. 
The slowest feature hidden in the signal is yi(t) — Xi(t) — X2(t) 2 = sin(i), and the second is X2(t) 2 . 

Both BSFA and IncSFA extract these features. Figure|2|b) shows the Root Mean Square Error (RMSE) 
of IncSFA signals compared to the BSFA output, over multiple epochs of training. The RMSE at the end 
of 10 epochs is found to be equal to [0.0360, 0.1078, 0.0377] T . 

Figure[2|c) and (g) shows feature outputs of batch SFA, and (to the right) IncSFA outputs at 2, 5, and 
10 epochs. Figures |2|g)-(j) show this comparison for the second feature. 

This result show that it is indeed possible to extract multiple slow features in an online way without 
storing covariance matrices. 

5.2 Extraction of a Driving Force from High Dimensional Input 

A classic slow feature extraction problem involves uncovering the driving force of a dynamic system hidden 
in a very complex signal. Here, a chaotic time series is derived from a logistic map ( |T Zito and Be rkes , 

[2008] ): 



x(t + 1) = (3.6 + 0.13 j(t))x(t) (1 - x(t)), (47) 
which is driven by a slowly varying driving force j(t) made up of two frequency components (5 and 11 
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Input Signal Driving Force Output rmse plot 




Figure 3: Experiment with a chaotic time series derived from a logistic map. A learning rate of r\ = 0.004 
is used, (a) Driving Force (b) Input (c) Output RMSE plot (d) BSFA output of the slowest feature (e)-(g) 
IncSFA output at t = 15, 30, 60 epochs. 



Hz) given by 



j(t) = sin(107rt) + sin(227rf). (48) 

To show the complexity of the signal, figures [3ja) and[3jb) plot the driving force signal j(t) and the 
generated time series x{t), respectively. 

A total of 1,000 discrete datapoints are used. The driving force cannot be extracted linearly, so a 
nonlinear expansion is used — temporal in this case. The signal is embedded in 10 dimensional space using 



a sliding temporal window of size 10 (the TimeFmmesNode from the MDP toolkit (T Zito and Berkes 



2008) is used for this). The signal is then spatially quadratically expanded to generate an input signal with 
65 dimensions. 

Figure [3jc) shows the convergence of IncSFA on the BSFA output, Figure [3jd) BSFA output, and 
Figures ^e)-(g) the outputs of IncSFA at 15, 30 and 60 epochs. The RMSE at 60 th epoch is found to be 
equal to 0.0984. 
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Batch SFA Output 




Unit 2 

(b) 



IncSFA Output 




Unit 2 

(d) 



Figure 4: (a) BSFA output of the first slow feature and (b) the second slow feature (c) IncSFA output of 
the first slow feature and (d) the second slow feature after 50,000 samples with learning rate r] = 0.003 
(figures best viewed in color). 



5.3 Invariant Spatial Coding from Simple Movement Data 

Our simulated agent performs a random walk in a two-dimensional bounded space. Brownian motion is 
used to generate agent trajectories approximately like those of rats. The agent's position p(t) = [x(t) , y(t)} 
is updated by a weighted sum of the current velocity and gaussian white noise, with standard deviation 
v r . The momentum term rn can assume values between zero and one, so that higher values of m lead to 
smoother trajectories and more homogeneous sampling of space in less time. Once the agent is predicted 
to cross the spatial boundaries, the current velocity is halved and an alternative random velocity update 
is generated, until a new valid position is reached. Noise variance v r = [3.0, 2.5] T , mass m = 0.75 and 
50, 000 data points are used for generating the training set. A separate test grid dataset samples positions 
and orientations at regular intervals, and is used for evaluation. 
Here is the used movement paradigm: 

currVel pit) — p(t — 1); 

repeat 

noise <— GaussianW hiteN oise2d() * v r ; 
p(t + 1) <— p(t) + m * currVel + (1 — m) * noise; 
if not isInsideWalkArea(p(t + 1)) : 
currVel <— currVel /2; 
until isInsideWalkArea(p(t + 1)) 

Under this movement paradigm ( [Franzius et al. 2007 1, SFA yields slow feature outputs in the form of 
half-sinusoids, shown in Figure [4] These features collectively encode the agent's x and y position in the 
environment. The first slow feature (Figure |4| a)) is invariant to the agent's x position, the second (Figure 
Qb)) to its y position (y axis horizontal). IncSFA's results (Figures |4^c)-(d)) are close to the ones of the 
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batch version, with an RMSE of [0.0536, 0.0914] T . 

5.4 Feature Adaptation to a Changing Environment 




Figure 5: (a) RMSE of IncSFA's first two output functions with respect to the true functions for original 
signal (epochs 1-59), and switched signal (epochs 60-120). (b) Normalized similarity (direction cosine) 
of the first slow feature to the true first slow feature of the current process, over 25 independent runs, (c) 
Normalized similarity of the second incremental slow feature. 

The purpose of this experiment is to illustrate how IncSFA's features adapt to a sudden shift in the input 
process. The input used is the same signal as in Experiment #1, but broken into two partitions. At epoch 
60, the two input lines x\ and x% are switched such that the x\ signal suddenly carries what x<i used to, and 
vice versa. We wish to show that IncSFA can first learn the slow features of the first partition, then is able 
to adapt to learn the slow features of the second partition. 

The signal is sampled 500 times per epoch. The CCIPCA learning rate parameters, also used to set the 
learning rate of the input average x, were set to t\ = 20, £2 = 200, c = 4, r = 5000. The MCA learning 
rate is 77 = 0.01. 

Results of IncSFA are shown in Fig. [5] demonstrating successful adaptation. To measure convergence 
accuracy, we use the direction cosine ( jChatterjee et al. 20001 between the estimated feature w(i) and true 
(unit length) feature w*, 

lw T m ■ w*i 

DirectionCosineit) = „ 1 „ ' , (49) 

ll wi (*) II ' ll w *ll 

The direction cosine equals one when the directions align (the feature is correct) and zero when they are 
orthogonal. 

BSFA results are shown in Fig. [6] The first batch feature catches the meta-dynamics and could actually 
be used to roughly sense the signal switch. However, the dynamics within each partition are not extracted. 
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Figure 6: Outputs of first two slow features, from epoch 59 through 61, extracted by batch SFA over the 
input sequence. 

5.5 Recovery from Outliers 



IncSFA First Feature's Response BSFA First Feature's Response 




Figure 7: First output signals of IncSFA and BSFA on the simple signal with a single outlier. 

Again, the learning rate setup and basic signal from the previous experiment is used, over 150 epochs, 
with 500 samples per epoch. A single outlier point is inserted: xi(100) = X2(100) = 2000. Figure [7] 
shows the first output signal of BSFA and IncSFA, showing that the one outlier point at time 100 (out of 
75,000) is enough to corrupt the first feature of BSFA, whereas IncSFA recovers. 



The relative lack of sensitivity of IncSFA to outliers is shown in a real-world experiment ( Kompella 



et al. 201 lb I, in which a person moves back and forth in front of a stable camera. At only one point in 
the training sequence, a door in the background is opened, and the BSFA hierarchical network's first slow 
feature became sensitive to this event. Yet, the AutoIncSFA network's first slow feature encodes the relative 
distance of the moving interactor. 



5.6 High-Dimensional Video with Linear IncSFA 

IncSFA's scalability is tested with an image sequence of dimension 41x41x3 (color images: see Fig.JSJa)). 
The agent is located in the middle of a square room with four complex-textured walls. In each episode, 
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Figure 8: (a) Stream of 90 41 x 41 x 3 images as the agent completes one turn (360 degrees). Viewed 
row-wise, left to right, (b) Data projected onto the first three features learned by IncSFA. This gives a 
compact encoding of the agent's state. 

starting from a different orientation, the agent rotates slowly (4 degree shifts from one image to the next) by 
360 degrees. At any time, a slight amount of Gaussian noise is added to the image (cr = 8). The agent has 
a video input sensor, and the sequence of image frames with 5, 043 dimensions is fed into a linear IncSFA 
directly. 

To reduce computation time, only the 40 most significant principal components are computed by 
CCIPCA, using learning rate parameters t\ — 20, — 200, c = 4, r = 5000. Computation of the 
covariance matrix and its full eigendecomposition (including over 5000 eigenvectors and eigenvalues) is 
avoided. On the 40 x 40 whitened difference space, only the first 5 slow features are computed via MCA 
and sequential addition. 500 epochs through the data took approximately 15 minutes using Matlab on a 
machine with an Intel i3 CPU and 4 GB RAM. 

The result of projecting the (noise-free) data onto the first three slow features are shown in Fig. |HJb). 
A single linear IncSFA has incrementally compressed this high-dimensional noisy sequence to a nearly 
unambiguous compact form, learning to ignore the details at the pixel level and attend to the true cyclical 
nature underlying the image sequence. A few subsequences have somewhat ambiguous encodings, because 
certain images associated with slightly different angles are very similar. 
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5.7 High-Dimensional Video and Episodic Learning 

"Real-world" learning systems might operate in series of several episodes of interactions with the environ- 
ment. IncSFA can be readily extended to episodic tasks, with a minor modification: The derivative signal, 
which is computed as a difference over a single time step, is simply not computed for the starting sample 
of each episode. The first data point in each episode is used for updating the PCs, but not the slow feature 
vectors. 

Here we present results obtained through a robot's episodic interactions with objects in its field of view. 
Two plastic cups are placed in the iCub robot's field of view. The robot performs motor babbling in one 
joint using a movement paradigm of Franzius et al. During the course of babbling, it happens to topple the 
cups, in one of two possible orders. The episode ends a short time after it has knocked both down. A new 
episode begins with the cups upright again and the arm in the beginning position. A total of 50 separate 
episodes were used as training data. 

Linear IncSFA is used on the entire 80 x 60 (grayscale) image. Only the 20 most significant principal 
components are computed by CCIPCA, using learning rate parameters t\ = 20, t% = 200, c = 2, r = 
10000. Only the first 5 slow features are computed via MCA and sequential addition, with learning rate 
0.001. The MCA vectors are normalized after each update during the first 10 epochs, but not thereafter 
(for faster convergence). Each of 25 different trials was over 400 randomly-selected (of the 50 possible) 
episodes. 

Results are shown in Fig. [9] We measured the slowness of the features on three "testing" episodes, after 
each episode of training. The upper left plot shows that all five features get slower over the episodes. After 
training completes, we can embed the data in a lower dimension with respect to the learned features. The 
embedding of 20 episodes are shown with respect to the first two PCs as well as the first two slow features. 
Since the cups being toppled or upright are the slow events in the scene, IncSFA's encoding is keyed on 
the object's state (toppled or upright). PCA does not find such an encoding, being much more sensitive to 
the arm. Since these events occurs once within each episode, BSFA cannot be used to learn these features. 



Figure 10 shows the average mutual direction cosine between non-identical pairs of slow features, and we 
can see the features quickly become nearly decorrelated. 

Such clear object-specific low-dimensional encoding, invariant to the robot's arm position, is useful, 
greatly facilitating training of a subsequent regressor or reinforcement learner. A video of the experimental 
result can be found at |http : / /www . idsia . ch/ -luciw/IncSFAArm/ IncSFAArm. html| 

5.8 Hierarchical IncSFA 

Deep networks composed of multiple stacked IncSFA nodes, each sensitive to only a small part of the 
input (i.e., receptive fields), can be used for processing high-dimensional image streams in a biologically 
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Slowness Over Episodes (20 Dimensional Discontinuous Input) 
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Figure 9: Experimental result of IncSFA on episodes where the iCub knocks down two cups via motor 
babbling on one joint. Upper left: The average slowness of the five features at each episode. Upper right: 
after training, several episodes (each episode is an image sequence where the cups are eventually both 
knocked down) are embedded in the space spanned by the first two PCs. Lower right: the same episodes 
are embedded in the space spanned by the first two slow features. We show some example images and 
where they lie in the embedding. The cluster in the upper right (A) represents when both cups are upright. 
When the robot knocks down the blue cup first, it moves to the cluster in the upper left (Bl). If it instead 
knocks down the brown cup, it moves to the lower right cluster (B2). Once it knocks down both cups, it 
moves to the lower left area (C). 
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Figure 10: Average slow feature similarity over episodes. 
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Figure 11: Example Hierarchical IncSFA Architecture, also showing the structure of an IncSFA node, 
which contains a linear IncSFA unit followed by nonlinear expansion followed by another linear IncSFA 
unit. 



plausible way. The computational reason for doing this is that very high-dimensional signals correspond to 
large search spaces, and hierarchical setups breaking up the signal can reduce the search burden. And using 
receptive fields reduces the number of necessary lower-order PCs that have to be computed by CCIPCA, 
which should speed the learning. 



Figure 1 1 shows an example deep network, motivated by the human visual system and based on the one 



specified by Franzius et al. ( |Franzius et al.| [2007 >. The network is made up of a converging hierarchy of 



Technical Report No. IDSIA-07-1 1 



26 



layers of IncSFA nodes, with overlapping rectangular receptive fields. Each IncSFA node finds the slowest 
output features from its input within the subspace of all monomials (e.g., of degree two if a quadratic 
expansion is used) of the node's inputs. 

We feed IncSFA with images from a high-dimensional video stream generated by the iCub simula- 
tor ( V. Tikhanoff and Nori 2008| l, an OpenGL-based software specifically built for the iCub robot. Our 
experiment mimics the robot observing a moving interactor agent, which in the simulation takes the form 
of a rectangular flat board moving back and forth in depth over the range [1, 3] (meters) in front of the robot, 
using a movement paradigm similar to the one discussed in Section|53] Figure[T2|a) shows the experimen- 



tal setup in the iCub simulator. Figure 12 'b) shows a sample image from the dataset. 20, 000 monocular 
images are captured from the robot's left eye and downsampled to 83x100 pixels (input dimension of 
8, 300). 

A three-layer IncSFA network is used to encode the images. Each SFA node operates on a spatial 
receptive field of the layer below. The first layer uses 15 x 19 nodes, each with 10 x 10 image patch 
receptive field and a 5 pixel overlap. Each node on this layer develops 10 slow features. The second layer 
uses 4x5 nodes, each having a 5 x 5 receptive field, and developing 5 slow features. The third layer 
uses two nodes, one sensitive to the top half, the other sensitive to the bottom half (5 slow features). The 
forth layer uses a single node and a single slow feature. The network is trained layer-wise from bottom to 
top, with the lower layers frozen once a new layer begins its training. The CCIPCA output of all nodes 
is clipped to [—5,5], to avoid any outliers that may arise due to close-to-zero eigenvalues in some of the 
receptive fields that contain unchanging stimuli. Each IncSFA node is trained individually, that is, there is 
no weight sharing among nodes. 



For comparison, a batch SFA hierarchical network was also trained on this data. Figures 12 show BSFA 
and IncSFA outputs. The expected output is of the form of a sinusoid extending over the range of board 
positions. IncSFA gives a slightly noisy output, probably due to the constant dimensionality reduction 
value for all units in each layer of the network, selected to maintain a consistent input structure for the 
subsequent layer; hence some units with eigenvectors corresponding to very small eigenvalues emerge in 
the first stage, with receptive fields observing comparatively few input changes, thus slightly corrupting the 
whitening result, and adding small fluctuations to the overall result. 

Finally, we evaluate how well the IncSFA feature codes for distance. A supervised quadratic regressor 
is trained with ground truth labels on 20% of the dataset, and tested on the other 80%, to measure the 
quality of features for some classifier or reinforcement learner using them (see RMSE plot). Hierarchical 
IncSFA derives the driving forces from a complex and continuous input video stream in a completely online 
and unsupervised manner. 
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Figure 12: (a) Experimental Setup: iCub Simulator (b) Sample image from the input dataset (c) Batch-SFA 
output (d) IncSFA output (?/ = 0.005) 

6 Conclusions 

Our novel Incremental Slow Feature Analysis technique solves SFA problems incrementally without stor- 
ing covariance matrices. IncSFA's covariance-free Hebbian and anti-Hebbian updates add biological plau- 
sibility to SFA itself. While batch SFA cannot handle certain open-ended uncontrolled settings, IncSFA 
can. This makes it a promising tool for learning autonomous robots. Future work will study online learning 
controllers whose experiments actively create data exhibiting novel but learnable regularities measured by 
improvements of emerging slow features, in line with the formal theory of curiosity (Schmidhuber 20 1 0) > . 
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